This report explores which chemical properties influence the quality of red wines.

Univariate Plots Section

The report explores a dataset containing quality and 11 features for 1599 red wines observations.

## [1] 1599   12
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
##        fixed.acidity     volatile.acidity          citric.acid 
##                    0                    0                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    0                    0 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                    0 
##            sulphates              alcohol              quality 
##                    0                    0                    0

Looking for the number of NA values for each column in the dataframe. It appears that none are missing.

##                             [,1]
## fixed.acidity         0.12405165
## volatile.acidity     -0.39055778
## citric.acid           0.22637251
## residual.sugar        0.01373164
## chlorides            -0.12890656
## free.sulfur.dioxide  -0.05065606
## total.sulfur.dioxide -0.18510029
## density              -0.17491923
## pH                   -0.05773139
## sulphates             0.25139708
## alcohol               0.47616632

Correlation showing all variables against quality. It appears that four attributes have a weak to moderate correlation (either negative or positive) with quality: volatile.acidity, citric.acid, sulphates, and alcohol.

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

Correlations between all variables.

A quick matrix chart showing some of the relationships between variables.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

The qualities conform to a fairly normal distribution. While the scores limits were 0-10, no wines fell below 3 or scored above 8 and most falling below a 6.

Explore each variable individually and the correlation with quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
## [1] 0.9809084

Since the fixed acidity is positively skewed, we’ll try some transforms.

Fixed acidity appears to be postively skewed in all charts, but log transformation gives the best normal distribution.

## [1] 0.1142376

The correlation with the log transform correlates worse than the normal attribute.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
## [1] 0.6703331

Volatile acidity is positively skewed.

Squareroot of squareroot appears to give the best normal distribution.

## [1] -0.3934108

But it doesn’t create a much stronger correlation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## [1] 0.3177403

Citric acid appears to be positively skewed, but the jump around .5 reduces the skewness measure.

Through all the transforms, it appears that squareroot creates the most normal distribution, but still has a large number of wines with almost no citric acid.

## [1] 0.2066822

And the squareroot actually lowers the correlation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500
## [1] 4.53214

Residual sugar has an extreme positive skewness.

The reciprocal transformation appears to bring residual sugar closest to a normal distribution.

## [1] -0.02898281

While this almost doubles the correlation (negatively), the correlation is insignificant.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## [1] 5.669694

Chlorides also has an extremely positive skewness.

Squareroot of squareroot brings chlorides closest to normal distribution.

## [1] -0.1656209

The squareroot of squareroot only creates a slighty higher correlation with quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
## [1] 1.248222

Free sulfur dioxide has a positive skewness.

Log transform creates the nicest normal distribution.

## [1] -0.05008749

But the correlation remains almost unchanged.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
## [1] 1.512689

Total sulfur dioxide has a positive skewness.

Log transform creates a fairly normal distribution.

## [1] -0.1701427

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040
## [1] 0.07115397

Density shows a fairly normal distribution so no transforms will be performed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010
## [1] 0.1933203

pH shows a normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
## [1] 2.424118

Sulphates are highly positively skewed.

Reciprical transform shows the best normal distribution.

## [1] -0.3403317

This actually increased correlation and turned it negative. We’ll explore both options.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## [1] 0.8592144

Alcohol has a positive skew.

The best normalization is created by the squareroot transform.

## [1] 0.4768205

No real change in corrlation, but we’ll plot both.

Bivariate Plots Section

The very slight positive correlation can be seen in the trendline.

The negative correlation is obvious in the trendline.

The weak positive correlation can be seen in the trendline.

The trendline is almost completely straight showing no real correlation.

The negative trendline show the slight correlation.

The fairly flat trendline show the lack of correlation.

The extremely weak correlation between total sulfur dioxide and quality can be seen in the trendline.

The slight negative correlation can be seen in the trendline.

The trendline shows almost no correlation.

The reciprocal sulphates show a much stronger negative trend with quality.

They both appear to have the strongest trendlines we’ve seen with quality.

Now we’ll see see how the four attributes with the hightest correlations with quality correlate with each other.

## [1] -0.5524957

Pretty strong negative correlation between volatile acidity and citric acid, but both of those attributes could be correlated merely because they are acids in the wine.

## [1] -0.2609867

Weak correlation between volatile acidity and sulphates.

## [1] -0.202288

Weak correlation between volatile acidity and alcohol.

## [1] 0.31277

Medium correlation between citric acid and sulphates.

## [1] 0.1099032

Very little correlation between citric acid and alcohol.

## [1] 0.09359475

Very little correlation between sulphates and alchohol.

Bivariate Analysis

Multivariate Plots Section

Multivariate Analysis

Final Plots and Summary

Plot One

Description One

Alcohol had the highest positive correlation with wine quality. This makes sense as one of the primary reason to have an alcoholic beverage in the first place is for alcohol. At around 7 level quality the vast majority of those wines contain an alcohol percentage greater than 10%.

Plot Two

Description Two

Sulphates had the second highest positive correlation with quality. Sulphates are additives to wines which acts as antimicrobial and antioxidant agents. These preserve the wines so perhaps an increase in sulphates would produce less likelihood that the wine tasted would have gone bad.

Plot Three

Description Three

This shows little correlation between the two highest positively correlated attributes to wine quality in the dataset, sulphates and alcohol. Since we’re ultimately trying to find the attributes which influence the quality of wine and possibly to predict the quality based on these attributes, it’s important that the features are not redundant. Redundant attributes lead to a model which overfits predictions.

Reflections

As wine quality was pretty much a categorical value containing mostly values of 5 or 6, these highly influenced the appearance of the graphs correlating with quality. I was hoping that some of the transforms would give a higher correlation with quality than just the normal attribute, but I didn’t see any real evidence of this with the transforms I created.

Some limitations are due to the volume of data. 1599 records is not a large dataset, perhaps I should have chosen the white wines instead. To investigate the data further, I would like to see a larger set. In addition, while the quality measure was a median of three wine experts, I would also like to see the mean in order to show a more continuous variable quality measurement.